Back

Journal of Computational Chemistry

Wiley

Preprints posted in the last 30 days, ranked by how well they match Journal of Computational Chemistry's content profile, based on 11 papers previously published here. The average preprint has a 0.01% match score for this journal, so anything above that is already an above-average fit.

1
Benchmarking generative AI and physics based molecular simulation for sampling conformational heterogeneity in T4 Lysozyme

Bhakat, S.

2026-05-13 biophysics 10.64898/2026.05.10.724101 medRxiv
Top 0.1%
2.0%
Show abstract

Wild-type T4 lysozyme (T4L) is used as a benchmark to evaluate conformational sampling across generative AI, AI-accelerated molecular simulation (AMS), and physics-based enhanced molecular dynamics (EMD). A four-state model: exposed/open, exposed/closed, buried/open, and buried/closed; is defined using physically meaningful collective variables. While generative AI methods (AF-cluster, MSA subsampling of AlphaFold2, ConforFold, AlphaFlow, ESMFlow, ConfRover, BioEmu) largely sample only the exposed/open state, AMS integrating generative ensembles with iterative molecular dynamics, recovering all states and reproducing equilibrium populations similar to EMD and experimental smFRET signatures.

2
SuBMIT: A Software Toolkit for Facilitating Simulations of Coarse-Grained Structure-Based Models of Biomolecules.

Prakash, D. L.; Banerjee, A.; Gosavi, S.

2026-05-20 biophysics 10.64898/2026.05.18.725912 medRxiv
Top 0.1%
1.7%
Show abstract

Coarse-grained structure-based models (CG-SBMs; or G[o] models) are simplified potential energy functions of biomolecules or biomolecular complexes that encode their structure. Molecular dynamics simulations of such SBMs have been successfully used to study long time-scale dynamics such as protein and RNA folding, and large conformational transitions of biomolecular complexes. SBMs have several advantages: (1) Their MD simulations are computationally inexpensive, making extensive sampling easily accessible to many researchers. (2) They are easy to modify and can be adapted for the specific biomolecular problem that needs to be investigated. However, the force-fields of SBMs are not usually included in commonly used biomolecular simulation packages resulting in a barrier to their use. Here, we present SuBMIT (Structure Based Models Input Toolkit; https://github.com/sglabncbs/submit), a toolkit for generating coarse-grained SBM input files for performing MD simulations with GROMACS and OpenMM/OpenSMOG. Simulations whose input files can be generated using the different flavors of CG-SBMs present in SuBMIT include the folding and conformational ensembles of proteins with intrinsically disordered regions, 3D-domain-swapping in proteins and the dynamics of RNA-protein assemblies (e.g., simple RNA viruses).

3
Bayesian-Steered Structure Prediction of Mechanical Biomolecules Using Twisted Diffusion

Klaus, C.; Sotomayor, M.

2026-05-13 bioinformatics 10.64898/2026.05.11.724187 medRxiv
Top 0.1%
1.7%
Show abstract

Deep learning approaches have revolutionized protein structure prediction. These tools are trained using experimental data and recapitulate reported conformations, but there is great interest in predicting conformations that may be functionally relevant although experimentally underrepresented. Since many modern structure prediction tools use generative artificial intelligence diffusion models, we reframe the search for alternative molecular conformations as that of sampling from a diffusion distribution conditioned using any arbitrary Bayesian likelihood. We implement a twisted diffusion sampler in Boltz-2 to sample this conditioned distribution and demonstrate the utility of this approach, which does not require any additional training of the neural network, by implementing a diffusion analog of steered molecular dynamics simulations applied to mechanical systems. We can reproduce predicted stretched states of fragments of DNA, the muscle protein titin, and the inner-ear protocadherin-15 protein, as well as open states of the MscL ion channel consistent with experimental results. We expect that steered structure predictions will help sample underrepresented and non-equilibrium conformations for many macromolecular systems.

4
CTGoMartini: A Python Framework for Simulating Biomolecular Conformational Transitions with Go-Martini Models

Yang, S.; Song, C.

2026-05-04 biophysics 10.64898/2026.04.30.721921 medRxiv
Top 0.1%
1.3%
Show abstract

Characterizing conformational transitions between distinct structural states is essential for understanding protein function but remains challenging due to the timescale limitations of atomistic molecular dynamics. While coarse-grained models like Martini accelerate sampling, classical elastic-network or G[o]-like restraints often trap proteins in a single energy basin, precluding the study of transition pathways between distinct functional states. Here, we present CTGoMartini, a comprehensive Python package designed to simulate protein conformational transitions using G[o]-Martini models in explicit membranes. CTGoMartini addresses key methodological limitations of existing approaches by redefining native contacts as a dedicated interaction type, thereby eliminating spurious protein aggregation artifacts in multi-copy simulations. The package implements both switching and multiple-basin approaches (Exponential and Hamiltonian mixing) to sample transitions between experimentally defined states. Furthermore, it integrates Hamiltonian replica exchange molecular dynamics (HREMD) with PyMBAR analysis, enabling efficient optimization of mixing parameters that govern barrier heights and relative state stabilities. We demonstrate the power of CTGoMartini through two biologically significant membrane protein systems: (1) capturing the inward-open to outward-open transition of the lipid transporter SPNS2, revealing the molecular mechanism of S1P translocation; and (2) elucidating how membrane surface tension and anionic lipids (POPA, PIP2) modulate the conformational equilibrium of the mechanosensitive ion channel TREK1. By streamlining model construction, simulation, and analysis, CTGoMartini offers an easy-to-use platform that connects static structural snapshots with their underlying dynamic functional mechanisms. TOC Graphic O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=118 SRC="FIGDIR/small/721921v1_ufig1.gif" ALT="Figure 1"> View larger version (26K): org.highwire.dtl.DTLVardef@75eb26org.highwire.dtl.DTLVardef@1a12accorg.highwire.dtl.DTLVardef@e927org.highwire.dtl.DTLVardef@1cb0dcd_HPS_FORMAT_FIGEXP M_FIG C_FIG

5
Design of DNA Aptamers for Lyme disease Diagnosis Combining experimental and numerical approaches

GAYRAUD, G.; Davila Felipe, M.; Padiolleau-Lefevre, S.; Maffucci, I.; Issouani, E. M.; Guerin, M.; Da Ponte, H.

2026-05-15 bioinformatics 10.64898/2026.05.13.724892 medRxiv
Top 0.2%
0.7%
Show abstract

Aptamers are single stranded DNA or RNA molecules selected for their high affinity and specificity to bind target molecules, similar to antibodies. They are commonly selected through the SELEX process, which involves the iterative exposure of a random sequence library to a target and retaining the sequences showing good binding properties. To improve Lyme disease detection, we propose designing aptamers that specifically bind to the CspZ protein on the surface of Borrelia burgdorferi, the bacterium responsible for the disease. Starting with a SELEX process consisting of thirteen rounds, from which selected in vitro sequence candidates have emerged, we aim to propose a holistic process that selects in silico new sequence candidates that are further validated experimentally. Our approach relies on 1) using Machine Learning (ML) techniques, specifically a Restricted Boltzmann Machine (RBM), to digitally replicate the last round of the SELEX process, 2) integrating insights from text analysis methods, such as word2vec and n-grams, into the RBM model trained on the final-round SELEX dataset to represent and compare newly generated sequences with in vitro candidates, 3) selecting in silico sequences with strong potential to bind to CspZ protein, 4) experimentally validating the selected in silico sequences of step 3. Our holistic approach combines biological insights with statistical models to improve the efficiency and outcome of the SELEX process. We enhance the RBM model, designed to replicate the distribution of the final SELEX round, by integrating geometric representations of sequences, which is especially advantageous when dealing with limited datasets relative to the vast sequence space. In addition, it provides in silico sequence candidates with strong binding properties.

6
Fast and Ultra-Capable Protein Design: Advancing the Frontier Through Atomistic SE(3)-Equivariance with Genie 3

Lin, Y.; Lee, M.; Vermani, A.; Jiang, E.; De Cooman, S.; Spetko, M.; AlQuraishi, M.

2026-05-05 bioinformatics 10.64898/2026.05.01.722168 medRxiv
Top 0.2%
0.7%
Show abstract

Despite the breakneck pace of progress in protein design methodology, frontier problems remain challenging, with leading methods struggling to design high-affinity binders, scaffold multiple functional motifs, or stabilize large multi-domain proteins. Recent research efforts have focused on two areas: improving model reasoning when generating active sites or binding interfaces, and improving concordance between the design process and the in silico oracle used to select promising designs. In addressing the first, the field has shifted towards all-atom models that capture sidechain conformations in atomistic detail by eschewing data-efficient SE(3)-equivariance, mirroring the evolution of AlphaFold2 to AlphaFold3. In addressing the second, recent work has focused on replacing generative models employing diffusion or flow-matching with hallucination approaches that directly optimize the oracle in sequence space; this improves success rates but reduces computational efficiency. Here, we close and surpass the generation-hallucination gap by revisiting SE(3)-equivariance using a branched polymer treatment of protein structures. The resulting diffusion model, Genie 3, achieves state-of-the-art performance on binder design, motif scaffolding, and unconditional generation, while being significantly faster than the best existing methods. We use Genie 3 to design a nanomolar binder of Nipah Glycoprotein G, a tetramer with minimal structural or biophysical characterization, as part of the Adaptyv Bio Nipah Competition, achieving a 12.5% success rate. Taken together, our results present a new frontier in protein design capability and a reexamination of the role of SE(3)-equivariance in molecular modeling.

7
AI-derived Protein Structures Validation: AlphaFold2 Models in the Twilight Zone

Griffin, P.; Deganutti, G.; Jadeja, K.; Idigbe, C.; Pipito', L.; Mejuto, L.; Ng, C. P.; Peck, S.; Greaves, J.; Reynolds, C. A.

2026-05-12 bioinformatics 10.64898/2026.05.12.724499 medRxiv
Top 0.2%
0.7%
Show abstract

In any field, unquestioningly accepting artificial intelligence (AI) results should be considered bad practise. Here, we devised a comparative modelling-based strategy for validating protein structures that exploits the well-known observation that protein folds are far more conserved than protein sequences. We identify proteins with a similar fold to the AlphaFold-generated query protein and determine their structural alignment to the query. The hypothesis is that if the sequence alignment coincides with the structural alignment, then the structure is validated. The strategy is implemented on a helix-by-helix and strand-by-strand basis using a multi-template pairwise local profile alignment method that works well into the twilight zone. The method is illustrated by application to the transmembrane transporter PEPT1, for which the structure is known, and the S-deacylases ABHD13 and ABHD16A, for which only AI-generated models exist. ABHD16A is particularly challenging because a sequence alignment search with BLASTp does not reveal any structural homologues and therefore requires work with extremely remote homologues; however, both models are validated through this strategy and are stable during classical molecular dynamics simulations. The ability of the strategy to identify errors is assessed with reference to misaligned ABHD13 models and misfolded decoy proteins.

8
Pathway Representation via Intrinsic Structural Medoids (PRISM): A Structural Mapping Approach to Clustering Molecular Pathways

Brylle Woody Santos, J.; Leung, J.; Chong, L.; Miranda Quintana, R. A.

2026-05-19 biophysics 10.64898/2026.05.16.725628 medRxiv
Top 0.2%
0.7%
Show abstract

We present Pathway Representation via Intrinsic Structural Medoids (PRISM), a state-aware framework for clustering pathways from molecular dynamics simulations of biomolecular transitions. In PRISM, each pathway is mapped to a small set of structural medoids obtained via a deterministic k-means clustering scheme. Pairwise pathway dissimilarities are computed using a weighted average Hausdorff distance between these representative sets, effectively capturing mean nearest-neighbor structural deviations while reducing sensitivity to outliers. Hierarchical agglomerative clustering of the resulting dissimilarity matrix defines pathway families. We evaluate PRISM across three biomolecular transitions of increasing complexity: alanine dipeptide C7eq [->] C7ax isomerization, adenylate kinase opening, and HIF-2 PAS-B ligand unbinding. PRISM consistently yields robust cluster assignments, with medoids faithfully representing distinct conformational states. By combining a state-based description with robust geometric dissimilarities, PRISM provides a scalable framework for organizing complex transition pathways.

9
Mantis-Delta: Mass-Action Network Theory and Steady-State Characterization for Chemical Reaction Networks

Venegas Hernandez, E. A.

2026-05-18 bioinformatics 10.64898/2026.05.14.725189 medRxiv
Top 0.2%
0.7%
Show abstract

Chemical Reaction Network Theory (CRNT), developed by Horn, Jackson, and Feinberg, provides parameter-free structural theorems that constrain the asymptotic dynamics of mass-action systems irrespective of the numerical values of the rate constants. Despite the maturity of the theory, modern open-source implementations that combine CRNT structural analysis with symbolic ordinary differential equation (ODE) construction and robust numerical steady-state finding remain scarce. We present mantis-delta, a pure Python library that ingests human-readable reaction strings, builds the complex reaction graph, computes the deficiency{delta} = n-{ell}-s and weak reversibility, and decides applicability of the Deficiency Zero Theorem (DZT) and Deficiency One Theorem (D1T). For systems satisfying these structural conditions, mantis-delta certifies, without any simulation whatsoever, existence, uniqueness and (for DZT) asymptotic stability of the positive steady state in every stoichiometric compatibility class. When the structural theorems do not apply, the library provides symbolic mass-action ODEs and Jacobians via SymPy and a hybrid numerical solver that combines stiff implicit integration with bound-constrained algebraic least-squares to locate both stable and unstable fixed points, including Hopf bifurcation centres inaccessible to forward integration. We demonstrate the workflow on six benchmarks: a reversible isomerisation, the Michaelis-Menten enzyme mechanism, the closed and chemostatted Brusselator, a catalytic hairpin assembly (CHA) miR-21 biosensor, and the Goldbeter-Koshland zero-order ultrasensitivity switch. In each case, the CRNT-predicted qualitative behaviour (monostability, oscillation, uniqueness) is recovered numerically with a residual below 10-6 M s-1, and the Goldbeter-Koshland dose-response curve agrees with the closed-form quasi-steady-state approximation to within 1% over a 400x kinase/phosphatase activity scan. mantis-delta is open-source (MIT license) and available at https://github.com/emiliovenegas/mantis-delta.

10
Deep Learning Structural Ensembles as Proxies for Protein Flexibility

Tunc, M. T.; Dizkirici Tekpinar, A.; Tekpinar, M.

2026-05-18 bioinformatics 10.64898/2026.05.16.725658 medRxiv
Top 0.3%
0.6%
Show abstract

Protein dynamics are essential to biological function, yet understanding whether deep learning models contain information about these dynamics remains an open question. In this study, we quantitatively investigate the capacity of deep learning structure generation methods to predict protein flexibilities by directly comparing residue-level mean squared fluctuation (MSF) profiles derived from structural ensembles with experimental or simulation-informed flexibility profiles. We assembled four diverse benchmark datasets representing different types of structural information, including 70 NMR ensembles, 43 X-ray crystallographic protein pairs in two distinct conformational states, 82 high-resolution cryo-EM structures, and molecular dynamics simulations of 10 proteins. Utilizing AlphaFold3, AlphaFold2, and RosettaFold to generate multiple structural models, we applied ranksort normalization to place the profiles on a comparable scale and quantified similarity primarily using cosine and Pearson similarities. Our results demonstrate that the flexibility predictions from deep learning-generated models agree well with experimental data, suggesting that fluctuations in these predicted ensembles can serve as effective proxies for protein flexibility. Notably, AlphaFold3 consistently produced the best results across the datasets. We also observed that flexibility prediction accuracy generally improves as the number of models increases up to 15, and our findings remain robust even when terminal residues are excluded from the analysis. To facilitate broader application, we provide three publicly accessible Jupyter Notebooks to calculate MSF from deep learning outputs. Ultimately, this work provides evidence that deep learning structural ensembles can serve as proxies for protein flexibility.

11
Reduced-Precision Stochastic Simulation For Mathematical Biology

Kimpson, T.; Flegg, M. B.; Flegg, J. A.

2026-05-06 systems biology 10.64898/2026.05.01.722176 medRxiv
Top 0.3%
0.5%
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWThe stochastic simulation algorithm (SSA) is widely used to perform exact forward simulation of discrete stochastic processes in biology. However, the computational cost, driven by sequential event-by-event sampling across large ensembles, remains a computational barrier. We investigate whether reduced-precision floating-point arithmetic can accelerate SSA without degrading statistical fidelity, drawing on the success of reduced-precision methods in weather and climate modelling. We evaluate two strategies across five canonical models (birth-death, Schlogl, Telegraph, dimerisation, repressilator): (i) mixed precision, computing propensities in 16-bit while maintaining accumulators in 32-bit; and (ii) uniform precision, performing all arithmetic in 16-bit. Mixed-precision SSA produces ensemble statistics that closely match the 64-bit reference for all models, as measured by Kolmogorov-Smirnov tests and Wasserstein distances. Under uniform precision, deterministic rounding introduces systematic biases across several models, with catastrophic failures in some cases. Stochastic rounding (SR) and propensity normalisation eliminate these biases, restoring distributional fidelity across all models tested (KS p > 0.05). Our results establish mixed-precision SSA with SR as a viable acceleration strategy for mathematical biology: 16-bit formats shrink per-variable data size by 2-4x relative to fp32/fp64, yielding comparable reductions in memory footprint and up to ~ 1.5x wall-clock speedup on CPU hardware that lacks native 16-bit arithmetic. As a hardware-level acceleration, mixed-precision SSA complements algorithmic methods such as tau-leaping and maps naturally onto modern GPU and TPU architectures with native 16-bit arithmetic.

12
Beyond Redfield: Thermodynamic Bounds and Non-Perturbative Quantum Dynamics in Tubulin Networks

Firmenich, F.; Firmenich, P.; Firmenich, L.

2026-05-13 biophysics 10.64898/2026.05.10.724047 medRxiv
Top 0.4%
0.4%
Show abstract

Quantum effects in biology are unavoidable at the molecular scale; the unresolved question is whether they can remain functionally relevant across the timescale gap between femtosecond molecular dynamics and microsecond-to-millisecond biological function. Here we formalize this mismatch as an equilibrium-to-functionality gap and use tubulin as a stringent open-system test case. We combine secular Lindblad, Redfield, and hierarchical equations of motion (HEOM) treatments to quantify decoherence, non-perturbative relaxation, and the physical amplification required for functional relevance. Equilibrium dephasing yields a conservative [Formula] fs at 310 K, with a generic protein-bath baseline of {approx} 13 fs. A completed 30 ps HEOM trajectory for the full 1JFF tryptophan network shows distributed non-Markovian relaxation, with terminal purity Pur = 0.210 and stretched-exponential exponent {beta}KWW {approx} 0.44, confirming that Redfield is useful as a short-time perturbative comparator but not quantitatively interchangeable with HEOM in this intermediate-coupling regime. We introduce a coherence-utility criterion [U] = [K]{tau}coh/{tau}func, separating required amplification from empirically bounded gain. A thermodynamic uncertainty relation closure shows that neural-scale cascade amplification would require Pmin [~] 10-7 W, about five orders of magnitude above the local microtubule GTP budget. Frohlich pumping is found to be linewidth-gated rather than generically micron-scale; ordered-water cavity QED and geometric subradiance remain experimentally testable but severely constrained candidates. The result is not a model of consciousness, but a reproducible physical benchmark framework for evaluating biological quantum-coherence claims under explicit open-system, energetic, and experimental constraints. Six falsifiable experimental programmes are prioritized, and the full computational framework is released with a validation ledger, cryptographic audit trail, and living supplementary material. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=107 SRC="FIGDIR/small/724047v1_ufig1.gif" ALT="Figure 1"> View larger version (20K): org.highwire.dtl.DTLVardef@19e4f42org.highwire.dtl.DTLVardef@65a719org.highwire.dtl.DTLVardef@1bd63beorg.highwire.dtl.DTLVardef@df77d8_HPS_FORMAT_FIGEXP M_FIG O_FLOATNOGraphical abstract.C_FLOATNO Equilibrium tubulin coherence lies in the femtosecond regime, while functional neural timescales lie in the millisecond regime. Frohlich pumping, QED-cavity protection, and geometric subradiance remain experimentally discriminable non-equilibrium candidates requiring independently bounded amplification. C_FIG FundingThis research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. Versioned computational scope of this releaseThis manuscript reports the theoretical framework, calibrated equilibrium baseline, Redfield/HEOM validation ledger, stratified Bayesian evidence synthesis, classical comparators, and falsifiable experimental design. The release-specific reproduction audit, including the current validation-check total and the SHA-256 fingerprints of the binary production artefacts (.npz, .pkl), is documented in LIVING_SI.md and outputs_data/raw_json/structur al/validation_report.json. A completed 30 ps HEOM production trajectory has been validated on constrained hardware; the master dataset contains the full 8-site population trajectory. A summary of those results is provided in [§]2.2.5. All claims made below are restricted to the numerical and theoretical evidence reported in this manuscript and its associated repository artefacts. The public repository ships the calibrated phenomenological baseline for accessibility; the HEOM production artefacts serve as the non-perturbative validation benchmark. All source figure outputs associated with this release are maintained in the public repository under outputs_data/figures_final/.

13
On the applicability domain of HADDOCK3 for protein-aptamer docking: documented failure modes from a 5x7 cross-target screening matrix and a 1676 aa receptor case study (P01031)

Dohi, E.

2026-05-12 bioinformatics 10.64898/2026.05.11.724398 medRxiv
Top 0.4%
0.4%
Show abstract

We screened a 5 receptor x 7 aptamer = 35-cell cross-target matrix with HADDOCK3 [1] under blind ambiguous-interaction-restraint (AIR) protocols on AlphaFold-modelled receptors. The screen surfaced 12 operationally distinct failure modes (collapsing to [~]8 conceptual classes; [§]3.1). The K_D-calibration subset is n = 4 cells with literature K_D records under matched assay conditions; the broader cohort includes [≥] 6 biological cognate or intended-cognate cells. The principal case study is P01031 (complement C5, 1676 aa, [≥] 12 structural domains): all 7 panel members produced positive HADDOCK3 top-1 scores under a scale-adaptive AIR. Score-term decomposition locates the anomaly in the AIR term (+217 to +268 to top-1 score). With AIR zeroed, scores fall to -131 to -74 -- the small-receptor regime. Boltz-2 cofolding chain-pair ipTM (cpi_AB) is an independent channel: P01031 shows the lowest median cpi_AB (0.211; 0/7 above the 0.5 confident-interface threshold). To our knowledge, this is the first reported case study of a 1676 aa multi-domain receptor exhibiting this signature under blind scale-adaptive AIR -- an n = 1 mechanistic case, not a statistical generalisation. We adapt the QSAR applicability domain concept [14-16] to in silico aptamer screening. [§]3.7 reports an empirical Mode 1 mitigation (pLDDT-aware AIR prefilter; cohort Jaccard recovery [~]10x).

14
Simple baselines rival protein language models in mutation-dense design tasks

Talpir, I.; Fleishman, S. J.

2026-05-06 bioinformatics 10.64898/2026.05.01.722313 medRxiv
Top 0.4%
0.4%
Show abstract

Computational protein design demands generally applicable models that reliably predict or generate unmeasured variants with superior functional properties. Although protein language models (pLMs) have been used in zero-shot and transfer-learning design studies, they have generally not been assessed in benchmarks that explicitly test combinatorial extrapolation from lower- to higher-order variants. Here we benchmark widely used pLMs against conventional baseline methods in recently described dense, experimentally validated multi-mutant landscapes. We find that regardless of architecture and parameter count, pLMs are statistically similar to one another, and none consistently outperforms conventional baseline methods. Furthermore, their ability to distinguish functional from non-functional variants in zero-shot prediction is comparable to that of conventional homology-based methods. We suggest that to contribute significantly to the design of protein function, pLMs may need to encode biophysical and structural priors or be combined with structure-based approaches.

15
Structural bias in machine learning-guided peptide design

Aldas-Bulos, V. D.; Plisson, F.

2026-05-08 bioinformatics 10.64898/2026.05.06.721805 medRxiv
Top 0.4%
0.3%
Show abstract

Machine learning continues to accelerate peptide and protein design through the rapid prediction and generation of sequences with desired characteristics. Many applications focus on predicting properties, functions, and structures, as well as generating point mutations and de novo designs. Nevertheless, many models prove less generalizable than initially claimed. Most predictors and generators are trained on sequential datasets, where imbalances can be addressed during preprocessing. In contrast, structural bias, a subtype of algorithmic bias arising from uneven representation of structural classes in training datasets, and the limitations of early protein structure predictors have frequently remained undetected and uncorrected. The recent surge in powerful protein structure prediction tools, such as the AlphaFold and RosettaFold series and their variants, now presents opportunities to mitigate this issue. We hypothesize that such structural sampling biases influence the downstream performance of ML models. Using antimicrobial peptides as a case study, we audited the structural biases in 16 state-of-the-art predictors for antimicrobial activity and tested whether structural information constrains their predictions. Our analysis revealed that models explicitly trained on sequential data still produce predictions biased by uneven fold representations and data leakage. These findings highlight the importance of integrating balanced structural data or implementing bias-mitigating strategies to develop agnostic models that maximize bioactive protein discovery and multi-objective optimization.

16
Efficient Stochastic Trace Generation for Transcription

Ferdowsi, A.; Fuegger, M.; Nowak, T.

2026-05-08 bioinformatics 10.64898/2026.05.05.722871 medRxiv
Top 0.5%
0.3%
Show abstract

Bursty transcription in single cells typically produces over-dispersed, skewed, and sometimes heavy-tailed expression distributions that are explained by two-state Markov models of the promoters. While the gold standard for simulation is exact stochastic sampling with Gillespies algorithm, obtaining thousands of timed traces is computationally costly. Surrogate models based on stochastic differential equations (SDEs) are widely used to speed up this simulation process. An example is the Chemical Langevin Equation based on Gaussian noise, which, however, does not capture heavy-tailed noise. In this work, we present a unified SDE framework that combines deterministic drift, Gaussian fluctuations, and additive sporadic jumps of arbitrary distributions, and provide an open-source Python implementation, bcrnnoise. The framework subsumes standard surrogate models and allows for vectorized generation of batches of transcription traces. We assess computational speed and accuracy of common surrogate models along with new models, showing that high accuracy can be obtained while reducing computational cost up to two orders of magnitude.

17
Benchmarking Boltz-2 for Screening of Therapeutic Antibody-Antigen Interactions

Fieux-Castagnet, A.; Waton, J.; Glukhonemykh, A.; Snow, E.; Ashokkumar, R.; Fleming, J.; Champagne, D.; Devenyns, T.; Peluffo, A.; Anagnostopoulos, C.

2026-05-14 bioinformatics 10.64898/2026.05.13.724924 medRxiv
Top 0.5%
0.3%
Show abstract

Protein structure prediction models (such as AlphaFold, Chai, Boltz) have transformed structural biology and are increasingly explored for drug discovery; however, their utility for large-scale screening of antibody-antigen (AB-AG) interactions remains unclear, particularly for distinguishing true binding from non-binding pairs at scale. To our knowledge, there has not been an exhaustive exploration of Boltz-2 inference settings on this high impact problem, and in this paper we set out to describe and implement a novel benchmarking framework that can accelerate progress in the field. We evaluated Boltz-2 (NVIDIA NIM implementation) on 519 therapeutic monoclonal antibodies from Thera-SAbDab, pairing each antibody with its cognate target and a randomly assigned non-cognate antigen. We developed a novel evaluation framework that systematically captures variability across stochastic seeds while benchmarking different inference settings, including datasets with and without crystallographically resolved antibody structures. Across settings, Boltz-2-derived confidence metrics showed weak, though above-chance, discrimination (0.5 < ROC-AUC < 0.60). Among evaluated metrics, the minimum value of the interface predicted TM-score (ipTM-min) across seed-samples, captured the strongest signal. Interestingly, additional feature aggregation and multivariate modelling provided little to no improvement. Increasing the number of stochastic predictions yielded front-loaded gains, with diminishing returns beyond [~]15-20 seed-samples, suggesting limited value of extensive sampling in practical workflows. Notably, inference without multiple sequence alignments (MSAs) slightly improved performance on non-crystallized antibodies ({Delta}AUROC {approx} +0.027) while reducing runtime by [~]8 seconds per prediction compared to shallow MSA settings. Overall, these results indicate that off-the-shelf confidence metrics from general-purpose structure prediction models may be insufficient for reliable target-antibody screening and highlight the need for task-specific optimization, while confirming that modest amounts of sampling can be helpful, but not in itself sufficient to improve performance significantly as gains plateau relatively quickly.

18
PDBe-SIFTS: an open-source tool for Structure Integration with Function, Taxonomy, and Sequences, featuring improved alignment, scoring scheme, and accelerated search

Bellaiche, A.; Choudhary, P.; Nair, S.; Harrus, D.; Yu, C. W.-H.; Tanweer, S. A.; Evans, G. L.; Lo, S. W.; Martin, M.; Fleming, J. R.; Velankar, S.

2026-05-04 bioinformatics 10.64898/2026.04.30.721839 medRxiv
Top 0.5%
0.3%
Show abstract

Structure Integration with Function, Taxonomy and Sequences (SIFTS) provides residue-level mappings between UniProt Knowledgebase sequences and Protein Data Bank structures and has historically been generated through internal Protein Data Bank in Europe (PDBe) pipelines. Here, PDBe-SIFTS is presented as a fully open-source, locally deployable implementation of this mapping framework. The pipeline combines fast, scalable sequence search using MMseqs2, an improved bounded scoring scheme for ranking candidate mappings, and residue-level mapping refinement based on backbone connectivity. PDBe-SIFTS is distributed as a Python package with command-line tools for 1) building a sequence search database, 2) identifying the best sequence-structure match, 3) one-to-one mapping at the residue level, and 4) generating SIFTS annotations in PDBx/mmCIF format. Benchmarking on the complete Protein Data Bank archive showed that MMseqs2 reduced archive-scale UniProtKB searches from hours with BLASTP to minutes, approximately 22-36 times faster, while curated mappings were recovered at top rank in 93.1% of cases. The remaining discrepancies mainly involved biologically ambiguous cases such as highly conserved proteins, chimeric constructs, or closely related orthologs. These results show that PDBe-SIFTS enables fast mapping, improving structural coherence in residue-level alignments while delivering the most up-to-date and accurate mappings, comparable to expert curation. Tool: https://github.com/PDBeurope/SIFTS Quick start notebook with example: https://github.com/PDBeurope/SIFTS/tree/master/notebooks Broader audience statementMatching protein sequences to their three-dimensional structures, and mapping annotations across both, is essential for understanding protein function, interactions, and molecular mechanisms. This integrated view enables richer interpretation of biological data and underpins advances in drug discovery, disease research, and protein engineering. PDBe-SIFTS provides an open and functional framework for structure-sequence mapping, allowing researchers and databases to run, inspect, and extend these mappings locally, while benefiting from faster searches, transparent scoring, and structurally informed residue-level alignments. Graphical abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=110 SRC="FIGDIR/small/721839v1_ufig1.gif" ALT="Figure 1"> View larger version (25K): org.highwire.dtl.DTLVardef@5e6ea6org.highwire.dtl.DTLVardef@1b2754dorg.highwire.dtl.DTLVardef@1334f9forg.highwire.dtl.DTLVardef@1b083a1_HPS_FORMAT_FIGEXP M_FIG C_FIG

19
Coupled Binding and Folding of NS2B/NS3 Protease and Linker Effects Revealed by Topology-based Modeling

Dong, K.; Huang, J.; Chen, M.; Chen, J.

2026-05-07 biophysics 10.64898/2026.05.04.722635 medRxiv
Top 0.6%
0.3%
Show abstract

Orthoflavivirus, such as West Nile Virus (WNV), dengue virus (DENV) and ZIKA virus (ZIKV), are globally distributed pathogens that pose substantial threats to human health. Currently, there are still no effective antiviral drugs for WNV or ZIKV. Despite the availability of two licensed DENV vaccines, their use remains limited due to potential risks, highlighting an urgent need for antiviral drug development. The highly conserved orthoflavivirus protease NS2B/NS3 is required for viral replication, making it a promising anti-flavivirus target. A major challenge, however, is the highly charged active site of this enzyme, which requires charged chemical matters with low bioavailability. An alternative and more attractive strategy is to target potential allosteric sites or folding intermediate states of the protease. In this work, we employ the topology-based coarse-grained G[o] modeling to explore the coupled binding and folding pathways of WNV NS2B/NS3 protease and study the effects of the widely used experimental construct with a G4SG4 linker between NS2B and NS3 on stability and folding. Our results provide a holistic conformational landscape of the protease binding and folding, including several key intermediate states. We find that the presence of the G4SG4 linker alters the folding pathways and destabilizes the NS2B C-terminus. The latter is consistent with experimental observations that the G4SG4 linked protease has lower activity and adopts an open state without the substrate in crystal structures. Together, these findings provide for the first time a complete picture of the binding and folding of the NS2B/NS3 protease and identify important folding intermediate states that could be targeted for allosteric antiviral drug development. TOC Figure O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=157 SRC="FIGDIR/small/722635v1_ufig1.gif" ALT="Figure 1"> View larger version (40K): org.highwire.dtl.DTLVardef@163c356org.highwire.dtl.DTLVardef@ad7b35org.highwire.dtl.DTLVardef@173ed8aorg.highwire.dtl.DTLVardef@1f026bf_HPS_FORMAT_FIGEXP M_FIG C_FIG

20
Generative Chemistry Platform for Small Molecules Targeting RNA: A Case Study for Chemical Optimization

Allen, T. E. H.; Bonnet, M.; Khan, R. T.

2026-05-12 bioinformatics 10.64898/2026.05.08.723908 medRxiv
Top 0.6%
0.2%
Show abstract

We introduce the Serna Bio GenAI platform, a generative chemistry and multiparametric optimization platform for the design of RNA-targeting small molecules. Targeting RNA with small molecules has proven historically challenging but offers notable potential upsides, including access to unique mechanisms of action and the ability to target otherwise untargetable genes. We consider a major challenge here to be designing chemistry specific to RNA-targeting. Molecular design is a valuable application of AI in drug discovery, but many publicly available models use training data focused on protein-targeting - the modality best historically explored in drug discovery. We showcase the difference and value in building a specifically RNA-targeting platform, comparing its performance to state-of-the-art public chemical generators and experimentally validating its chemical designs in comparison to chemistry designed by a human expert.